Parallel dynamics of bacterial genome reduction across independent transitions to endosymbiosis.



authors:

  • Giobbe Forni
  • Jacopo Martelossi
  • Dario Pistone
  • Benoit Morel
  • Claudio Bandi
  • Matteo Montagna

date: “12 Giugno 2024”



0 - BACKGROUND:

Abstract: The establishment of symbiosis dramatically alters the evolution of the associated species, making symbiotic systems ideal models for studying the impact of lifestyle changes on genomes. Here, we focused on Enterobacterales, a large and ancient bacterial lineage that includes endosymbionts with diverse host associations, ranging from gut inhabitants to intracellular environments, and from horizontal to vertical transmission. Leveraging over two hundred genomes, along with cutting-edge single-copy gene concatenation and multi-copy gene family approaches, we inferred a robust phylogenetic framework that supports eleven independent transitions to endosymbiosis. Inferences on patterns of genome evolution confirm previous hypotheses about the processes underlying genome reduction: a substantial spike in gene loss always occurs simultaneously with the establishment of endosymbiosis, while a reduction in gene acquisition mechanisms is associated with the subsequent genome erosion. Furthermore, gene family loss frequencies were correlated across independent endosymbiotic clades; genes with more conserved functions and stronger constraints on sequence evolution are lost less frequently, suggesting that differences in gene essentiality and dispensability drive the observed parallelism. Our analyses contribute to the coming of age of the theory of genome evolution in symbiotic associations and provide novel insights into the importance of recombination as an opposing force against genome erosion.



1 - PHYLOGENETIC INFERENCES:



code for concatenation-based species tree inferences:
iqtree2 -s min95%sp_p.ol.all_parts.aln -m TESTMERGE --rcluster 10 -spp min95%sp_p.ol.all_parts.prt -mset JTT,WAG,LG -bb 1000 -mrate G,R,E -T 32 --prefix all_parts_site_homogeneous

iqtree2 -s min95%sp_p.ol.good_part.aln -m TESTMERGE --rcluster 10 -spp min95%sp_p.ol.good_part.prt -mset JTT,WAG,LG -bb 1000 -mrate G,R,E -T 32 --prefix good_part_site_homogeneous 

iqtree2 -s min95%sp_p.ol.all_parts.aln -m TESTMERGE -mset JTT,WAG,LG,LG4M,LG4X,CF4,C10,C20,C30,C40,C50,C60 -madd JTT+C10,JTT+C20,JTT+C30,JTT+C40,JTT+C50,JTT+C60,WAG+C10,WAG+C20,WAG+C30,WAG+C40,WAG+C50,WAG+C60,LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60 -bb 1000 -T 32 --prefix all_parts_site_heterogeneous

iqtree2 -s min95%sp_p.ol.good_part.aln -m TESTMERGE -mset JTT,WAG,LG,LG4M,LG4X,CF4,C10,C20,C30,C40,C50,C60 -madd JTT+C10,JTT+C20,JTT+C30,JTT+C40,JTT+C50,JTT+C60,WAG+C10,WAG+C20,WAG+C30,WAG+C40,WAG+C50,WAG+C60,LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60 -bb 1000 -T 32 --prefix good_part_site_heterogeneous

iqtree2 -s min95%sp_p.ol.RS4.all_parts.aln -m TESTMERGE -rcluster 10 -spp min95%sp_p.ol.RS4.all_parts.prt -mset JC,F81,GTR -bb 1000 -mrate G,R,E -T 16 --prefix SR4_all_parts_site_homogeneous

iqtree2 -s min95%sp_p.ol.RS4.good_part.aln -m TESTMERGE -rcluster 10 -spp min95%sp_p.ol.RS4.good_part.prt -mset JC,F81,GTR -bb 1000 -mrate G,R,E -T 16 --prefix SR4_good_part_site_homogeneous

iqtree2 -s min95%sp_p.ol.RS4.all_parts.aln -m TEST -mset JC,F81,GTR -mdef SR4_profileModels_IQtree.nxs -madd SR4C10JC,SR4C20JC,SR4C30JC,SR4C40JC,SR4C50JC,SR4C60JC,SR4C10F81,SR4C20F81,SR4C30F81,SR4C40F81,SR4C50F81,SR4C60F81,SR4C10GTR,SR4C20GTR,SR4C30GTR,SR4C40GTR,SR4C50GTR,SR4C60GTR -bb 1000 -mrate G,R,E -T 32 --prefix SR4_all_parts_site_heterogeneous

iqtree2 -s min95%sp_p.ol.RS4.good_part.aln -m TEST -mset JC,F81,GTR -mdef SR4_profileModels_IQtree.nxs -madd SR4C10JC,SR4C20JC,SR4C30JC,SR4C40JC,SR4C50JC,SR4C60JC,SR4C10F81,SR4C20F81,SR4C30F81,SR4C40F81,SR4C50F81,SR4C60F81,SR4C10GTR,SR4C20GTR,SR4C30GTR,SR4C40GTR,SR4C50GTR,SR4C60GTR -bb 1000 -mrate G,R,E -T 32 --prefix SR4_good_part_site_heterogeneous



code for gene families tree inferences:

We built up gene families trees for each OG using two different dataset and the ParGene pipeline: (1) aa sequences for all gene families, (2) SR4-recoded aa sequences for all gene families. the “\(Aln_dir" point to the directory were are stored original alignments and the recoded ones. "\)out” is the out directory.

pargenes.py -a "$Aln_dir" -o "$out" -c 16 -d aa -m --modeltest-criteria BIC --modeltest-perjob-cores 4



code for gene families-based species tree inferences:

We performed three gene families based tree inferences (1) all gene families (aa alignments and trees); (2) gene families that pass the iqtree stationary/homogeneity test (aa alignments and trees) and (3) all gene families but SR4-recoded (SR4-recoded alignments and trees). The GeneRax comand was the same for all analyses changing only the family file (“$Family_file” variable).

generax --families "$Family_file" --si-strategy HYBRID --species-tree MiniNJ --rec-model UndatedDTL --per-family-rates --prune-species-tree --si-estimate-bl --si-quartet-support --prefix "$out"



code for gene trees corrections and reconciliations:

The species trees resulting from SpeciesRax analyses were re-rooted to correct the root position and all gene trees were re-corrected for 3 rounds and reconciled infering per-species DTL rates

generax -f "$Family_file" -s "$tree" --per-species-rates --strategy SPR -p "$out" --max-spr-radius 3



R library requirements:
library(phytools)
library(phangorn)
library(ggtree)
library(ggplot2)
library(reshape2)
library(TreeDist)
library(ComplexHeatmap)
library(geiger)
library(ape)
library(gridExtra)
library(phylotools)
library(ggfortify)
library(ggExtra)
library(treeio)
library(tidyr)
library(tidytree)
library(plotly)
library(ggsignif)
library(ggpubr)
library(stringr)
library(forcats)
library(dplyr)
library(cowplot)
library(ggrepel)
library(extrafont)
library(corrplot)
library(rstatix)
library(ggcorrplot)
library(ggwordcloud)
library(patchwork)



2 - TOPOLOGY ANALYSES:

Here we compare all the Enterobacterales topologies we found.



import data pt. 1:

We start by importing the different phylogenetic inferences along with the tips tratis:

concat_all_hom_aa <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_homogeneous_aa.nwk")
concat_all_het_aa <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_heterogeneous_aa.nwk")

concat_hom_hom_aa <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_homogeneous_aa.nwk")
concat_hom_het_aa <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_heterogeneous_aa.nwk")

concat_all_hom_SR4 <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_homogeneous_SR4.nwk")
concat_all_het_SR4 <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_heterogeneous_SR4.nwk")

concat_hom_hom_SR4 <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_homogeneous_SR4.nwk")
concat_hom_het_SR4 <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_heterogeneous_SR4.nwk")

genfam_all_aa <- read.tree("phylogenetic_resolutions/SpeciesRax_all_genes_aa.nwk")
genfam_hom_aa <- read.tree("phylogenetic_resolutions/SpeciesRax_homogeneous_genes_aa.nwk")
genfam_all_SR4 <- read.tree("phylogenetic_resolutions/SpeciesRax_all_genes_SR4.nwk")

genfam_all_SR4 <- drop.tip(genfam_all_SR4, "NZ-LG025083.1")

annot <- read.csv("sp_annotations.csv", header=T)
sp_traits <- read.csv("sp_traits.csv", header=T)

row.names(annot) <- annot$species
row.names(sp_traits) <- sp_traits$species

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

info <- info[,-c(1,2)]



Fig. S2 - visualizing differences among inferences:

A more detailed way to take a look at the differences among our inferences by plotting them side by side.

pdf(file="FigS2.pdf", width=25, height=25)

ref_tree <- genfam_all_SR4
ref_tree <- drop.tip(ref_tree,c("NZ-CP028926.1","NZ-CP009610.1"))                               # drop outgroups
ref_tree <- reroot(ref_tree,which((ref_tree$tip.label == "NZ-CP050969.1") == "TRUE"))           # reroot on Plesiomonas shigelloides
ref_tree$tip.label <- gsub("\\.[0-9]", "",ref_tree$tip.label )
ref_tree$edge.length<-NULL

other_tree <- concat_all_het_SR4
other_tree <- drop.tip(other_tree,c("NZ-CP028926.1","NZ-CP009610.1"))                           # drop outgroups
other_tree <- reroot(other_tree,which((other_tree$tip.label == "NZ-CP050969.1") == "TRUE"))     # reroot on Plesiomonas shigelloides
other_tree$tip.label <- gsub("\\.[0-9]", "",other_tree$tip.label )
other_tree$edge.length<-NULL

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

info <- info[match(ref_tree$tip.label, info$species.y),]
ref_tree$tip.label <- info$sp

info <- info[match(other_tree$tip.label, info$species.y),]
other_tree$tip.label <- info$sp

vs<-cophylo(ref_tree,other_tree,rotate=T)
## Rotating nodes to optimize matching...
## Done.
plot(vs,link.lwd=0.5,link.lty="solid",
     link.col=make.transparent("darkgray ",0.75),fsize=0.5)
title(main="gene families / all gene families / SR4 recoding (11) VS concatenation / all single copy genes / mixture model / SR4 recoding (04)", line = -1, cex.main = 2)

nodelabels.cophylo(round(as.numeric(ref_tree$node.label),2), cex=0.5, which="left", frame = "none")
nodelabels.cophylo(as.numeric(other_tree$node.label), cex=0.5, which="right", frame = "none")

info <- merge(annot, sp_traits, by = "row.names")
info <- info[match(ref_tree$tip.label, info$sp),]
state <- setNames(as.factor(info$state),info$sp)
cols<-setNames(c("orange","#c0ecf5"),levels(state))
tiplabels.cophylo(pie=to.matrix(state[vs$trees[[1]]$tip.label],levels(state)),
    piecol=cols,cex=0.1,which="left")
info <- merge(annot, sp_traits, by = "row.names")
info <- info[match(other_tree$tip.label, info$sp),]
state <- setNames(as.factor(info$state),info$sp)
tiplabels.cophylo(pie=to.matrix(state[vs$trees[[2]]$tip.label],levels(state)),
    piecol=cols,cex=0.1,which="right")


dev.off() 
## quartz_off_screen 
##                 2



export trees for subsequent analyses:
tree <- genfam_all_SR4
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))             # reroot on Plesiomonas shigelloides
write.tree(tree, "./DLT_analyses_SpeciesRax_all_genes_SR4/DLT_analyses_SpeciesRax_all_genes_SR4_starting_tree.nwk")

tree <- concat_all_het_SR4
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))             # reroot on Plesiomonas shigelloides
write.tree(tree, "./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4_starting_tree.nwk")



3 - ANALYSES ON “BEST” TOPOLOGY:



D/L/T of gene families on the “gene families + all genes + SR4” tree:
generax --families "$Family_file" --si-strategy HYBRID --species-tree MiniNJ --rec-model UndatedDTL --per-family-rates --prune-species-tree --si-estimate-bl --si-quartet-support --prefix "$out"
n=1; while read line; do if echo $line | grep -q Node; then ((n++)); ref=$(echo $line | awk '{print $1}'); echo $ref s$n; fi; done < counts.txt > conversion.txt
cp DLT_tree.newick DLT_tree_conv.newick; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" DLT_tree_conv.newick; done < conversion.txt
cp counts.txt counts_conv.txt; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" counts_conv.txt; done < conversion.txt 
cp rates.txt rates_conv.txt; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" rates_conv.txt; done < conversion.txt
for file in *speciesEventCounts.txt; do while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" $file; done < conversion.txt; done



data wrangling pt. 1a:
node_annotation = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",
                             sep="\t",header=T)   #Path to node annotation

rates = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/rates_conv.txt",
                   sep=" ",header=F)    #Path to per_species_rates
colnames(rates) <- c("node","duplication_rate","loss_rate","transfer_rate")
counts = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/counts_conv.txt",
                    sep=" ",header=F)  #Path to per_species_counts
colnames(counts) <- c("node","S","SL","D","T","TL")
counts$S <- gsub("S=", "", counts$S)
counts$SL <- gsub("SL=", "", counts$SL)
counts$D <- gsub("D=", "", counts$D)
counts$T <- gsub("T=", "", counts$T)
counts$TL <- gsub("TL=", "", counts$TL)

tmp  <- left_join(node_annotation, rates, by = 'node')
data  <- left_join(tmp, counts, by = 'node')



Fig. S3 - contingent losses across different endosymbiont lineages:
timeline <-  read.table("DLT_analyses_SpeciesRax_all_genes_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

tmp <- timeline %>%
  mutate(name2=origin.x)

FigS3 <- tmp %>%
  ggplot( aes(x=coordinate, y=loss_rate)) +
  geom_line( data=tmp %>% dplyr::select(-origin.x), aes(group=name2), color="lightgray", size=0.55, alpha=0.5) +
  geom_line( aes(color=origin.x), color="orange", size=1 ) + 
  theme(
      legend.position="none",
      plot.title = element_text(size=14),
      panel.grid = element_blank()
    ) + theme_bw() + facet_wrap(~origin.x, nrow=3) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
    theme(strip.background =element_rect(fill="white")) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) +
  labs(x = "branch") + labs(y = "mean loss rate")


ggsave("FigS3.pdf", FigS3, width = 6, height = 6)

FigS3



Fig. 2b - contingent gene losses overall:
timeline <-  read.table("DLT_analyses_SpeciesRax_all_genes_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

median(subset(timeline, coordinate == 0)$loss_rate)
## [1] 0.5924475
sd(subset(timeline, coordinate == 0)$loss_rate)
## [1] 0.2099138
median(subset(timeline, coordinate != 0)$loss_rate)
## [1] 0.1847425
sd(subset(timeline, coordinate != 0)$loss_rate)
## [1] 0.1231039
median(subset(timeline, coordinate == 0)$loss_rate)/median(subset(timeline, coordinate != 0)$loss_rate)
## [1] 3.206883
sd(subset(timeline, coordinate == 0)$loss_rate)/mean(subset(timeline, coordinate != 0)$loss_rate)
## [1] 1.238111
tmp <- timeline %>%
  mutate(name2=origin.x)

Fig2b <- ggplot(data=tmp, aes(x=coordinate, y=loss_rate, colour=origin)) + 
  xlab("") + theme_bw() + theme(text = element_text(size = 10)) + 
  geom_smooth(xintercept=-1, linetype="dashed", colour = "orange", size=1, alpha=0.2, colour=timeline$origin) +
  theme(legend.position="none", plot.title = element_text(size=14)) + 
  theme_bw() + theme(axis.title.x=element_blank()) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
  theme(strip.background = element_rect(fill="white")) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + labs(x = "branch") + labs(y = "gene loss rate") + labs(x = "branch")



Fig. 2a + 3a + 3c - DTL rates on the phylogeny:
tree <-  read.tree("DLT_analyses_SpeciesRax_all_genes_SR4/DLT_tree_conv.newick") #Path to specie tree with node labels

tree <- drop.tip(tree,c("NZ-CP028926.1","NZ-CP009610.1"))           # drop outgroups
data <- subset(data, node != "NZ-CP028926.1")
data <- subset(data, node != "NZ-CP009610.1")
data <- subset(data, node != "s209")
data <- subset(data, node != "s2")

#Create two dataframes for internal nodes and tips and order them based on the tree
data_node = data[grepl('^s|^Root', data$node),]
toDrop = rownames(data_node)
data_tip = data[!(row.names(data) %in% toDrop),]
data_node = data_node %>% arrange(factor(node, levels = tree$node.label))
data_tip = data_tip %>% arrange(factor(node, levels = tree$tip.label))
#Add node number
data_node$node = 1:tree$Nnode+Ntip(tree)
data_tip$node = 1:Ntip(tree)
#Merge the two dataframes
df = rbind(data_node,data_tip)
#Merge phylogenetic tree and data
tree2 <- full_join(tree, df, by = 'node')
#write out a nexus file with all annotations
write.beast(tree2,"DLT_analyses_SpeciesRax_all_genes_SR4/Enterobacteriales_annotated_tree.nxs")
#read the nexus file
beast_tree=read.beast("DLT_analyses_SpeciesRax_all_genes_SR4/Enterobacteriales_annotated_tree.nxs")

#Plot the loss tree
Fig2a <- ggtree(beast_tree, aes(color=as.numeric(loss_rate)), branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("loss rates") +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")
 

#Plot the duplication tree
Fig3a <- ggtree(beast_tree, aes(color=as.numeric(duplication_rate)),branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("duplication rates") +
   geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")

#Plot the transfer tree
Fig3c <- ggtree(beast_tree, aes(color=as.numeric(transfer_rate)),branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("transfer rates") +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")



Fig. 2c + 3b + 3d - DTL rates differences between symbiont and free-living branches:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

state <- ""
levels(state) <- c("endosymbionts","free-living")
cols<-setNames(c("orange","#c0ecf5"),levels(state))

all_branches_events$state[all_branches_events$state == "E"] <- "endosymbionts"
all_branches_events$state[all_branches_events$state == "N"] <- "free-living"

shapiro.test(all_branches_events$loss_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$loss_rate
## W = 0.85423, p-value < 2.2e-16
p <- ks.test(all_branches_events$loss_rate[all_branches_events$state == "endosymbionts"],all_branches_events$loss_rate[all_branches_events$state == "free-living"],alternative="less")$p.value
Fig2c <- ggplot(all_branches_events, aes(y=loss_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("loss rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

median(subset(all_branches_events, state == "free-living")$loss_rate)
## [1] 0.08906095
sd(subset(all_branches_events, state == "free-living")$loss_rate)
## [1] 0.102837
median(subset(all_branches_events, state != "free-living")$loss_rate)
## [1] 0.131917
sd(subset(all_branches_events, state != "free-living")$loss_rate)
## [1] 0.1275177
shapiro.test(all_branches_events$duplication_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$duplication_rate
## W = 0.25542, p-value < 2.2e-16
p <- ks.test(all_branches_events$duplication_rate[all_branches_events$state == "endosymbionts"],all_branches_events$duplication_rate[all_branches_events$state == "free-living"],alternative="greater")$p.value
Fig3b <- ggplot(all_branches_events, aes(y=duplication_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("duplication rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates") + theme(plot.title = element_text(size = 10)) + coord_cartesian(ylim=c(0, .05))

median(subset(all_branches_events, state == "free-living")$duplication_rate)
## [1] 0.003766365
sd(subset(all_branches_events, state == "free-living")$duplication_rate)
## [1] 0.1346826
median(subset(all_branches_events, state != "free-living")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, state != "free-living")$duplication_rate)
## [1] 0.1505295
shapiro.test(all_branches_events$transfer_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$transfer_rate
## W = 0.84092, p-value < 2.2e-16
p <- ks.test(all_branches_events$transfer_rate[all_branches_events$state == "endosymbionts"],all_branches_events$transfer_rate[all_branches_events$state == "free-living"],alternative="greater")$p.value
Fig3d <- ggplot(all_branches_events, aes(y=transfer_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("transfer rates p= ",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

median(subset(all_branches_events, state == "free-living")$transfer_rate)
## [1] 0.09097955
sd(subset(all_branches_events, state == "free-living")$transfer_rate)
## [1] 0.1283512
median(subset(all_branches_events, state != "free-living")$transfer_rate)
## [1] 0.0817777
sd(subset(all_branches_events, state != "free-living")$transfer_rate)
## [1] 0.050599



wrapping up Fig. 2 - losses:
Fig2tmp <- plot_grid(Fig2b,Fig2c,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("B","C"))

Fig2 <- plot_grid(Fig2a,Fig2tmp,
                     ncol = 2,
                     axis = "tr", 
                     labels = c("A",""))

ggsave("Fig2.pdf", Fig2, width = 12, height = 8)

Fig2



wrapping up Fig. 3 - transfer and duplication:
Fig3tmp1 <- plot_grid(Fig3a,Fig3c,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("A","C"))

Fig3tmp2 <- plot_grid(Fig3b,Fig3d,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("B","D"))

Fig3 <- plot_grid(Fig3tmp1,Fig3tmp2,
                  ncol = 2,
                  axis = "lr", 
                  align = "hv")

ggsave("Fig3.pdf", Fig3, width = 12, height = 12)

Fig3



Fig. S4 - DTL disparity across different endosymbiont lineages and association strength:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

all_branches_events$strength[all_branches_events$strength == "0"] <- "free-living"
all_branches_events$strength[all_branches_events$strength == "1"] <- "strict"
all_branches_events$strength[all_branches_events$strength == "2"] <- "loose"
all_branches_events$strength[all_branches_events$strength == "1/2"] <- "mixed"

levels(state) <- c("free-living",
                   "strict",
                   "loose",
                   "mixed")

cols<-setNames(c("#c0ecf5","orange","orange","orange"),levels(state))

median(subset(all_branches_events, strength == "free-living")$loss_rate)
## [1] 0.0883924
sd(subset(all_branches_events, strength == "free-living")$loss_rate)
## [1] 0.1119016
median(subset(all_branches_events, strength != "free-living")$loss_rate)
## [1] 0.133066
sd(subset(all_branches_events, strength != "free-living")$loss_rate)
## [1] 0.1112036
median(subset(all_branches_events, strength == "strict")$loss_rate)
## [1] 0.1550725
sd(subset(all_branches_events, strength == "strict")$loss_rate)
## [1] 0.1208997
median(subset(all_branches_events, strength == "mixed")$loss_rate)
## [1] 0.120976
sd(subset(all_branches_events, strength == "mixed")$loss_rate)
## [1] 0.05203798
median(subset(all_branches_events, strength == "loose")$loss_rate)
## [1] 0.160289
sd(subset(all_branches_events, strength == "loose")$loss_rate)
## [1] 0.203509
FigS4a <- ggplot(all_branches_events, aes(y=loss_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$loss_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip() 

median(subset(all_branches_events, strength == "free-living")$transfer_rate)
## [1] 0.0890355
sd(subset(all_branches_events, strength == "free-living")$transfer_rate)
## [1] 0.127596
median(subset(all_branches_events, strength != "free-living")$transfer_rate)
## [1] 0.0848525
sd(subset(all_branches_events, strength != "free-living")$transfer_rate)
## [1] 0.0486793
median(subset(all_branches_events, strength == "strict")$transfer_rate)
## [1] 0.0949539
sd(subset(all_branches_events, strength == "strict")$transfer_rate)
## [1] 0.05277865
median(subset(all_branches_events, strength == "mixed")$transfer_rate)
## [1] 0.0844569
sd(subset(all_branches_events, strength == "mixed")$transfer_rate)
## [1] 0.03892602
median(subset(all_branches_events, strength == "loose")$transfer_rate)
## [1] 0.0451819
sd(subset(all_branches_events, strength == "loose")$transfer_rate)
## [1] 0.06546718
FigS4b <- ggplot(all_branches_events, aes(y=transfer_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$transfer_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip()

median(subset(all_branches_events, strength == "free-living")$duplication_rate)
## [1] 0.0037185
sd(subset(all_branches_events, strength == "free-living")$duplication_rate)
## [1] 0.1326349
median(subset(all_branches_events, strength != "free-living")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, strength != "free-living")$duplication_rate)
## [1] 0.1557959
median(subset(all_branches_events, strength == "strict")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, strength == "strict")$duplication_rate)
## [1] 0.00889187
median(subset(all_branches_events, strength == "mixed")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, strength == "mixed")$duplication_rate)
## [1] 0.1874086
median(subset(all_branches_events, strength == "loose")$duplication_rate)
## [1] 0.02718265
sd(subset(all_branches_events, strength == "loose")$duplication_rate)
## [1] 0.1967789
FigS4c <- ggplot(all_branches_events, aes(y=duplication_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates (square root transformed)") + theme(plot.title = element_text(size = 10)) + scale_y_continuous(trans='sqrt') + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$duplication_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip()
all_branches_events <- data %>% filter(! str_detect(node, "root"))

all_branches_events["origin"][all_branches_events["origin"] == "EST"] <- "E"
all_branches_events["state"][all_branches_events["state"] == "EST"] <- "E"
all_branches_events <- drop_na(all_branches_events)

FigS4d <- ggplot(all_branches_events, aes(y=loss_rate, x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=median(all_branches_events$loss_rate[all_branches_events$origin == "0"]), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene loss rate")

FigS4e <- ggplot(all_branches_events, aes(y=transfer_rate, x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=median(all_branches_events$transfer_rate[all_branches_events$origin == "0"]), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene transfer rate")

FigS4f <- ggplot(all_branches_events, aes(y=sqrt(duplication_rate), x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=sqrt(median(all_branches_events$duplication_rate[all_branches_events$origin == "0"])), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene duplication rate (square root transformed)")
FigS4 <- plot_grid(FigS4a,
                   FigS4b,
                   FigS4c,
                   FigS4d,
                   FigS4e,
                   FigS4f,
                   nrow = 2,
                   axis = "lr", 
                   align = "hv",
                   labels = c("A","B","C","D","E","F"))

ggsave("FigS4.pdf", FigS4, width = 18, height = 12)

FigS4



data wrangling pt. 2a:
tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

ancs <- vector()  # genes present in Enterobacterales ancestor 
endo <- vector()  # genes present in at least an endosymbiont clade MRCA
aftr <- vector()  # genes originated or acquired in Enterobacterales after any shift to endosymbiosis  
nevr <- vector()  # genes which where originated or acquired in free-living only

ubiq <- vector()  # genes present in all Enterobacterales spp.
up50 <- vector()  # genes present in more than 50% of Endosmybionts spp.
dn50 <- vector()  # genes present in less than 50% of Endosmybionts spp.
lost <- vector()  # genes absent in Endosmybionts spp.

j=1

  for (i in files) {
    #i="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts//OG???_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt"
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",i)
    OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts//","",OG)
    names(df2)=c("node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="node")
    if (sum(length(df2[df2$node == "s3",]$P)) > 0) {
      ancs[[j]] <- OG
    }
    if (sum(as.numeric(df2[df2$state == "EST",]$L)) > 0) {
      endo[[j]] <- OG
      df2 <- df2[with(df2, ! grepl("^s", node)),]
      df2 <- subset(df2, state == "E" |  state == "EST")
      df2 <- subset(df2, P == 1)
      if (length(df2$node) == 75) {
        ubiq[[j]] <- OG
        } else if (length(df2$node) > 36) {
          up50[[j]] <- OG
          } else if (length(df2$node) > 0) {
            dn50[[j]] <- OG
            } else if (length(df2$node) == 0) {
              lost[[j]] <- OG
           }
      } else {
        df2 <- df2[with(df2, ! grepl("^s", node)),]
        if (sum(length(df2[df2$state == "E" | df2$state == "EST",]$P)) > 0) {
          aftr[[j]] <- OG
        } else {
          nevr[[j]] <- OG
    }
      }
    j = j + 1
  }

length(files)         # Among all genes
## [1] 10152
length(na.omit(endo)) # genes present in at least one free-livinig MRCA of an endosymbiotic clade
## [1] 4241
length(na.omit(aftr)) # genes originated or acquired in endosmybionts after any shift to endosymbiosis  
## [1] 1994
length(na.omit(nevr)) # genes which where originated or acquired in free-living only
## [1] 3917
# Among genes which were present in at least one free-living MRCA of an endosymbiotic clade
length(na.omit(ubiq)) # genes present in all endosmybiont spp.
## [1] 39
length(na.omit(up50)) # genes present in more than 50% of endosmybionts spp.
## [1] 413
length(na.omit(dn50)) # genes present in less than 50% of endosmybionts spp.
## [1] 2684
length(na.omit(lost)) # genes lost in all endosmybionts spp.
## [1] 1105



data wrangling pt. 3a:
files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

tokeep$Origin[tokeep$Trait != "N"] <- "endosymbiont"
tokeep$Origin[tokeep$Trait == "N"] <- "free-living"

for (k in c("free-living","endosymbiont")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "NA"
    }
      j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(`df_free-living`,df_endosymbiont)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)
df$mean_freq <- as.numeric(df$mean_freq)



Fig. S5 - families mean loss frequency delta across functional categories:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)
cog <- read.table("cog.tsv", sep="\t", header = T)
clean <- left_join(clean, cog, by = "OG")
clean <- clean %>% separate_rows(cog, sep = '(?<=.)(?=.)')
cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
clean <- left_join(clean, cog_conversion, by = "cog")

clean <- subset(clean, cog != "A")
clean <- subset(clean, cog != "B")
clean <- subset(clean, cog != "W")
clean <- subset(clean, cog != "Z")
clean <- subset(clean, cog != "na")
clean <- subset(clean, cog != "-")

clean <- clean[clean$OG %in% endo,]  

delta <- clean %>% group_by(cog, origin) %>% summarise(median_freq=median(mean_freq))
delta <- as.data.frame(na.omit(delta))
delta <- reshape(delta, idvar = "cog", timevar = "origin", direction = "wide")
delta$delta <- (delta$median_freq.endosymbiont - delta$`median_freq.free-living`)
delta <- left_join(delta, cog_conversion, by = "cog")
delta <- delta %>% arrange(delta)
delta <- delta[order(delta$delta, decreasing = TRUE), ]

delta
##    cog median_freq.endosymbiont median_freq.free-living      delta
## 19   Q                0.5000000              0.08333333 0.41666667
## 18   I                0.3969697              0.06189046 0.33507924
## 17   P                0.3964286              0.06443032 0.33199825
## 16   T                0.3850160              0.06577968 0.31923634
## 15   G                0.4045455              0.08731323 0.31723222
## 14   K                0.3783784              0.08661417 0.29176421
## 13   S                0.3571429              0.06593674 0.29120612
## 12   E                0.3333333              0.06185567 0.27147766
## 11   V                0.3284314              0.06957995 0.25885143
## 10   C                0.2817460              0.05635889 0.22538714
## 9    M                0.2714988              0.04919355 0.22230522
## 8    U                0.2752525              0.05719639 0.21805614
## 7    N                0.2944820              0.08417269 0.21030929
## 6    L                0.2353267              0.03901555 0.19631114
## 5    H                0.2307692              0.03922862 0.19154061
## 4    O                0.2385013              0.05584768 0.18265362
## 3    F                0.2183099              0.04562044 0.17268942
## 2    D                0.1760155              0.03419347 0.14182200
## 1    J                0.1127098              0.05000000 0.06270983
##                                                      description
## 19  Secondary metabolites biosynthesis, transport and catabolism
## 18                                Lipid transport and metabolism
## 17                        Inorganic ion transport and metabolism
## 16                                Signal transduction mechanisms
## 15                         Carbohydrate transport and metabolism
## 14                                                 Transcription
## 13                                              Function unknown
## 12                           Amino acid transport and metabolism
## 11                                            Defense mechanisms
## 10                              Energy production and conversion
## 9                         Cell wall/membrane/envelope biogenesis
## 8  Intracellular trafficking, secretion, and vesicular transport
## 7                                                  Cell motility
## 6                          Replication, recombination and repair
## 5                              Coenzyme transport and metabolism
## 4   Posttranslational modification, protein turnover, chaperones
## 3                            Nucleotide transport and metabolism
## 2     Cell cycle control, cell division, chromosome partitioning
## 1                Translation, ribosomal structure and biogenesis
order <- delta$description

FigS5 <- ggplot(data=clean, aes(x = factor(description, order), y=mean_freq, fill=origin)) + 
  geom_boxplot(outlier.shape = NA, position = position_dodge(1) ) +
  coord_flip() + theme_bw() + ylim(0, 1.2) +
  stat_compare_means(aes(label = after_stat(p.signif)), size=5, label.y = 0.75, label.x = 10, paired = T) +
  scale_fill_manual(values=c("orange", "#c0ecf5")) + 
  xlab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency") + ylim(0,1)

ggsave("FigS5.pdf", FigS5, width = 12, height = 12)

FigS5



Fig. 4c - families mean loss frequency correlate with free-living selection regime:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free < 1)

clean <- clean[clean$OG %in% endo,]  

dim(clean) # number of OGs considered in this analysis
## [1] 4143    3
Fig4c <- ggplot(clean, aes(x=dnds_free, y=mean_loss_freq_endo)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  theme_bw() +
  xlab("dNdS in free-living") + 
  ylab("mean loss frequency in endosymbionts") +
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1) +
  theme(legend.position = "none") + xlim(0,.75) +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(aspect.ratio = 1)
  

Fig4c <- Fig4c + stat_cor(method = "spearman", cex = 5)

ggsave("Fig4c.pdf", Fig4c, width = 7, height = 7)

Fig4c



code for dNdS inferences:

Nucleotide OGs were reconstructed from amino acids OGs and subsequently splitted in endosymbiont and free-living. Subsequently alignment was performed using a custom script and trimmining was carried out using trimal; genetrees were inferred using the pargenes pipeline. Here, “\(genefam_AA" point to the directory where the OGs analyzed previously using Generax are found, while "\)genefam_NT” points to the directory where the OGs used for dNdS will be found. Instead, “$genomes_CDS_NT” points to the directory were each genome coding sequences are found. Finally, codeml was run using the alignement and associated genetree file.


for j in "$genefam_AA"; do for i in $(grep ">" $j); do sp=$(echo $i | awk -F "%" '{print $1}' | tr -d ">" |  awk -F "." '{print $1}'); gene=$(echo $i | awk -F "%" '{print $3}' | awk -F "_" 'NF{NF-=1};1' | sed 's/ /_/g'); grep -A 1 $gene "$genomes_CDS_NT"/$sp* >> "$genefam_NT"/${j::-14}.clean.nt; done; done

for i in *.clean.nt; do for j in $(grep $'\tN\t' sp_annotations | awk '{print $1}'); do grep -A 1 $j $i >> ${i::-3}.free.clean.nt; done; done

for i in *.clean.nt; do for j in $(grep $'\tE\t' sp_annotations | awk '{print $1}'); do grep -A 1 $j $i >> ${i::-3}.endo.clean.nt; done; done

sh msa_align.sh -i input_ogs -f output_aln

for i in *.aln; do trimal -in "$i" -out "$i".gappyout -gappyout; done 

for i in *gappyout; do trimal -in "$i" -out "$i"_NoSp -resoverlap 0.6 -seqoverlap 50; done

pargenes.py -a alignements -o output -d nt --use-modeltest



Fig. S8a - families mean loss frequency and free-living selective regime - boxplot:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free < 1)

clean['state'] <- "NA"

clean[clean$OG %in% ubiq,]$state <- "ubiquitous" 
clean[clean$OG %in% up50,]$state <- "more than 50% spp." 
clean[clean$OG %in% dn50,]$state <- "less than 50% spp." 
clean[clean$OG %in% lost,]$state <- "allways lost" 

clean <- subset(clean, state != "NA")

my_comparisons <- list( 
  c("ubiquitous", "more than 50% spp."), c("ubiquitous", "less than 50% spp."), c("ubiquitous", "allways lost"),
  c("allways lost", "more than 50% spp."), c("allways lost", "less than 50% spp."), 
  c("more than 50% spp.", "less than 50% spp.")
  )

clean <-subset(clean, dnds_free < 1)

FigS8a <- ggplot(clean, aes(y=dnds_free, x=state)) + 
  geom_boxplot(width=0.2, outlier.shape = NA, fill="orange") + 
  theme_bw() +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  labs(y = "dNdS in free-living species") + labs(x = "gene present in at least one free-living MRCA of an endosymbiotic clade") + 
  stat_compare_means(comparisons = my_comparisons, method="wilcox.test") 



Fig. S7 - selection regime shift is consistent across different functional categories:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, endo!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)

clean$delta <- clean$endo - clean$free
clean <- subset(clean, delta!="NA")

cog <- read.table("cog.tsv", sep="\t", header = T)
cog_clean <- left_join(clean, cog, by = "OG")
cog_clean <- cog_clean %>% mutate(PCT = ntile(delta, 1000))  %>%  data.frame()  

cog_clean <- cog_clean %>%separate_rows(cog, sep = '(?<=.)(?=.)')

cog_clean <- subset(cog_clean, cog != "S")
cog_clean <- subset(cog_clean, cog != "NA")
cog_clean <- subset(cog_clean, cog != "-")
cog_clean <- subset(cog_clean, cog != "B")  # too few genes for the cog "Chromatin structure and dynamics" just has one genes
cog_clean <- subset(cog_clean, cog != "W")  # too few genes for the cog "Extracellular structures"

cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
cog_clean <- left_join(cog_clean, cog_conversion, by = "cog")

median(cog_clean$delta)
## [1] 0.07677
sd(cog_clean$delta)
## [1] 0.08757883
FigS7 <- cog_clean %>% 
  group_by(description) %>%
  mutate(mdelta = median(delta)) %>%
  subset(origin == "free-living")  %>%
  ggplot(aes(delta)) +
  geom_density(aes(y=..count..)) +
  theme_bw() + 
  geom_vline(aes(xintercept= 0), colour='black', linetype="dashed") +
  geom_vline(aes(xintercept= mdelta), colour='orange', size=0.5) +
  facet_wrap(~ description, nrow=3) +
  theme(strip.text = element_text(face = "bold", size = rel(.4)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(x = "delta dNdS" ) + labs(y = "number of genes")


ggsave("FigS7.pdf", FigS7, width = 12, height = 6)

FigS7



data wrangling pt. 4a:
files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

for (k in c("A","B","C","D","E1","E2","F","G","H","I","L","0")){
  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "NA"
    }
    j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(df_A,df_B,df_C,df_D,df_E1,df_E2,df_F,df_G,df_H,df_I,df_L,df_0)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)



Fig. S6 - functional categories ranking for families mean loss frequency:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean <- subset(clean, dnds != "NA")
clean <- subset(clean, mean_freq != "NA")
clean <- subset(clean, mean_freq != "NaN")
clean <- subset(clean, free != "NaN")
clean <- subset(clean, endo != "NaN")

cog <- read.table("cog.tsv", sep="\t", header = T)
clean_endo <- left_join(clean, cog, by = "OG")

clean_endo <- subset(clean_endo, cog != "NA")
clean_endo <- subset(clean_endo, cog != "-")

clean_endo <- clean_endo %>%separate_rows(cog, sep = '(?<=.)(?=.)')

clean_endo <- subset(clean_endo, cog != "A")
clean_endo <- subset(clean_endo, cog != "B")
clean_endo <- subset(clean_endo, cog != "W")
clean_endo <- subset(clean_endo, cog != "Z")

clean_endo <- left_join(clean_endo, cog_conversion, by = "cog")

clean_endo$mean_freq <- as.numeric(clean_endo$mean_freq)

clean_endo <- clean_endo[clean_endo$OG %in% endo,]  

A <- subset(clean_endo, origin == "A" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
A$rank <- rownames(A)
A$origin <- "A"

B <- subset(clean_endo, origin == "B" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
B$rank <- rownames(B)
B$origin <- "B"

C <- subset(clean_endo, origin == "C" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
C$rank <- rownames(C)
C$origin <- "C"

D <- subset(clean_endo, origin == "D" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
D$rank <- rownames(D)
D$origin <- "D"

E1 <- subset(clean_endo, origin == "E1" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E1$rank <- rownames(E1)
E1$origin <- "E1"

E2 <- subset(clean_endo, origin == "E2" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E2$rank <- rownames(E2)
E2$origin <- "E2"

effe <- subset(clean_endo, origin == "F" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
effe$rank <- rownames(effe)
effe$origin <- "effe"

G <- subset(clean_endo, origin == "G" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
G$rank <- rownames(G)
G$origin <- "G"

H <- subset(clean_endo, origin == "H" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
H$rank <- rownames(H)
H$origin <- "H"

I <- subset(clean_endo, origin == "I" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
I$rank <- rownames(I)
I$origin <- "I"

L <- subset(clean_endo, origin == "L" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
L$rank <- rownames(L)
L$origin <- "L"

zero <- subset(clean_endo, origin == "0" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
zero$rank <- rownames(zero)
zero$origin <- "zero"

loss_freq <- dplyr::bind_rows(A,B,C,D,E1,E2,effe,G,H,I,L,zero)
loss_freq$freq <- as.numeric(loss_freq$freq)
loss_freq$rank <- as.numeric(loss_freq$rank)

loss_freq$origin[loss_freq$origin == "effe"] <- "F"
loss_freq$origin[loss_freq$origin == "zero"] <- "free"

loss_freq <- loss_freq %>%  mutate(mycolor = ifelse(origin == "free", "#c0ecf5", "orange"))

order <- aggregate(rank~description,loss_freq,sum)
order <- order[order(order$rank),]$description

FigS6 <- ggplot(loss_freq, aes(x = factor(description, order), y=(as.numeric(rank)), label=origin, fill=mycolor)) + 
  geom_point() + 
  geom_label_repel(max.overlaps = Inf, min.segment.length = 0) + scale_fill_identity(c("#c0ecf5", "orange")) +
  theme_bw() + coord_flip() +
  xlab("") + ylab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 12)) + theme(axis.text.x = element_text(size = 12)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency ranks")

ggsave("FigS6.pdf", FigS6, width = 12, height = 12)

FigS6



Fig. 4b - functional categories ranking for families mean loss frequency - wordcloud:
loss_freq_endo <- subset(loss_freq, mycolor == 'orange')[,c(1,3,4)]
loss_freq_endo <- aggregate(rank~description,loss_freq_endo,sum)

loss_freq_endo$description[loss_freq_endo$description == "Secondary metabolites biosynthesis, transport and catabolism"] <- "Secondary metabolites"
loss_freq_endo$description[loss_freq_endo$description == "Intracellular trafficking, secretion, and vesicular transport"] <- "Intracellular trafficking"
loss_freq_endo$description[loss_freq_endo$description == "Secondary metabolites biosynthesis, transport and catabolism"] <- "Secondary metabolites"

set.seed(sample(1:100,1))

Fig4b <- ggplot(loss_freq_endo, aes(label = description, size = rank, color = rank)) +
  geom_text_wordcloud(shape = "square", grid_size = 0, rm_outside = F, eccentricity = 2) +
  scale_size_area(max_size = 10) +
  theme_void() +
  scale_color_gradient(low = "darkgray", high = "orange") +
  theme(aspect.ratio = 1)

ggsave("Fig4b.pdf", Fig4b, width = 12, height = 12)

Fig4b



Fig. 4a - families mean loss frequency correlation across different endosymbiont clades:
df_cor <- data.frame(clean_endo[,c(-4,-5,-6,-7,-8,-9)])
df_cor <- reshape(df_cor, idvar = "OG", timevar = "origin", direction = "wide")
row.names(df_cor) <- df_cor$OG
df_cor <- df_cor[,c(-1)]

new_order <-  sort(colnames(df_cor),decreasing=F)
df_cor <- df_cor[, new_order]

colnames(df_cor) <- c("free-living","A","B","C","D","E1","E2","F","G","H","I","L")
my.palette <- get_palette(c("white", "white", "orange"), 10)
cor.test <- df_cor %>% cor_test(method = "spearman", use='pairwise.complete.obs')
cor.test$p <- p.adjust(cor.test$p, method = "hochberg", n = length(cor.test$p))
cor.mat <- as_cor_mat(cor.test)
p.mat <- cor_pmat(df_cor)

Fig4a <- ggcorrplot(cor.mat, p.mat = p.mat, sig.level = 0.001, hc.order = F, type = "lower", outline.col = "white", ggtheme = ggplot2::theme_bw,
           colors = c("white", "white", "orange"),method = "circle") + 
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14))

ggsave("Fig4a.pdf", Fig4a, width = 7, height = 7)

Fig4a



Fig. S8b - families mean loss frequency correlation with free-living selection regime (across different endosymbiont lineages + free-living):
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

clean <- subset(clean, mean_freq != "NA")
clean$mean_freq <- as.numeric(clean$mean_freq)

clean$origin[clean$origin == 0] <- 'free-living'

clean <- clean[clean$OG %in% endo,]  

FigS8b <- ggplot(clean, aes(x=free, y=mean_freq)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  facet_wrap( ~ origin) +
  stat_cor(method = "spearman", label.y = 0.75, size = 2) +
  theme_bw() +
  xlab("dNdS in free-living species") + 
  ylab("gene loss rate") +
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio = 1) +
  facet_wrap( ~ origin) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1)



Fig. S8:
FigS8 <- plot_grid(FigS8a,FigS8b,
                  axis = "l", 
                  align = "hv",
                  labels = c("A","B"))

ggsave("FigS8.pdf", FigS8, width = 12, height = 6)

FigS8



linearize the tree:
tree <- read.tree("DLT_analyses_SpeciesRax_all_genes_SR4/DLT_tree_conv.newick")
tree <- drop.tip(tree, "NZ-LG025083.1")
tree <- drop.tip(tree,c("NZ-CP028926.1","NZ-CP009610.1"))           # drop outgroups
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))     # reroot on Plesiomonas shigelloides
tree$tip.label <- gsub("\\.[0-9]", "",tree$tip.label )

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

#info <- info[!info$species.y == "NZ-CP028926", ]
#info <- info[!info$species.y == "NZ-CP009610", ]

treec1 <- chronos(tree, model = "correlated", lambda = 1)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -45.7859 
## Optimising rates... dates... -45.7859 
## Optimising rates... dates... -45.78589 
## 
## log-Lik = -45.75865 
## PHIIC = 1336.63
lnLt1 <- attr(treec1, "ploglik")
treec2 <- chronos(tree, model = "correlated", lambda = 2)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.01406 
## Optimising rates... dates... -46.01406 
## Optimising rates... dates... -46.01406 
## 
## log-Lik = -46.10371 
## PHIIC = 1338.85
lnLt2 <- attr(treec2, "ploglik")
treec3 <- chronos(tree, model = "correlated", lambda = 3)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.24338 
## Optimising rates... dates... -46.24338 
## Optimising rates... dates... -46.24328 
## 
## log-Lik = -46.32981 
## PHIIC = 1339.56
lnLt3 <- attr(treec3, "ploglik")
treec4 <- chronos(tree, model = "correlated", lambda = 4)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.39016 
## Optimising rates... dates... -46.39016 
## Optimising rates... dates... -46.38921 
## 
## log-Lik = -46.50299 
## PHIIC = 1340.41
lnLt4 <- attr(treec4, "ploglik")
treec5 <- chronos(tree, model = "correlated", lambda = 5)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.47846 
## Optimising rates... dates... -46.47846 
## Optimising rates... dates... -46.47368 
## 
## log-Lik = -46.59152 
## PHIIC = 1341.06
lnLt5 <- attr(treec5, "ploglik")
treec6 <- chronos(tree, model = "correlated", lambda = 6)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.69738 
## Optimising rates... dates... -46.69738 
## Optimising rates... dates... -46.69172 
## 
## log-Lik = -46.81291 
## PHIIC = 1340.76
lnLt6 <- attr(treec6, "ploglik")
treec7 <- chronos(tree, model = "correlated", lambda = 7)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.64066 
## Optimising rates... dates... -46.64066 
## Optimising rates... dates... -46.6327 
## 
## log-Lik = -46.80559 
## PHIIC = 1341.94
lnLt7 <- attr(treec7, "ploglik")
treec8 <- chronos(tree, model = "correlated", lambda = 8)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.94669 
## Optimising rates... dates... -46.94669 
## Optimising rates... dates... -46.94262 
## 
## log-Lik = -47.0448 
## PHIIC = 1341.35
lnLt8 <- attr(treec8, "ploglik")
treec9 <- chronos(tree, model = "correlated", lambda = 9)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.74922 
## Optimising rates... dates... -46.74922 
## Optimising rates... dates... -46.74632 
## 
## log-Lik = -46.94889 
## PHIIC = 1342.23
lnLt9 <- attr(treec9, "ploglik")
treec10 <- chronos(tree, model = "correlated", lambda = 10)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.58149 
## Optimising rates... dates... -46.58149 
## Optimising rates... dates... -46.55624 
## 
## log-Lik = -46.73676 
## PHIIC = 1345.25
lnLt10 <- attr(treec10, "ploglik")

lnLt <- c(lnLt1,lnLt2,lnLt3,lnLt4,lnLt5,lnLt6,lnLt7,lnLt8,lnLt9,lnLt10)

plot(lnLt)

treec <- chronos(tree, model = "correlated", lambda = 1)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -45.7859 
## Optimising rates... dates... -45.7859 
## Optimising rates... dates... -45.78589 
## 
## log-Lik = -45.75865 
## PHIIC = 1336.63
info <- info[match(treec$tip.label, info$species.y),]

treec$tip.label <- info$sp



Fig. 1 - endosymbiosis ancestral state reconstruction:
state <- setNames(as.factor(info$endosymbiont),info$sp)

DOLLO <- matrix(c(0,1,0,0),2,2)

fitARD<-ace(state,treec, model=DOLLO, type="discrete")

cols<-setNames(c("orange","turquoise"),levels(state))

plot.phylo(treec,type="fan", cex=0.45, lwd=1, label.offset=0.025)

tiplabels(pie=to.matrix(state[treec$tip.label],levels(state)),piecol=cols,cex=0.1)

nodelabels(node=1:treec$Nnode+Ntip(treec),
           pie=fitARD$lik.anc,piecol=cols,cex=0.15)

add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
                  y=0.8*par()$usr[3],fsize=0.8)

pdf(file="Fig1.pdf", width=15, height=15)

tree <- beast_tree
origina_annot <- read.table("phylogenetic_resolutions/list_of_origins.tsv", sep="\t", header=T)

tmp <- data
tmp$species <- gsub("\\.[0-9]", "",tmp$node )
tmp  <- left_join(tmp, origina_annot, by = 'species')
origina_annot  <- tmp[,c(1,14,15,16,17,18,19,20,21,22,23,24)]

wow <- origina_annot
wow <- wow[,c(1,3)]
colnames(wow) <- c("id","species")

p <- ggtree(tree, layout='circular', branch.length='none')  %<+% wow + geom_tiplab(aes(label=species), offset = 7, size=3,fontface = "italic") +
  theme(legend.position = "none")

p <- p + xlim(NA, 31)

p <- p + scale_color_manual(values=c("black", "white")) + geom_tippoint(aes(color=state)) + scale_color_manual(values=c("orange","orange","#c0ecf5")) +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")
 
rownames(origina_annot) <- origina_annot$node
origina_annot <- origina_annot[,c(5,6,7,8,9,10,11,12)]

gheatmap(p, origina_annot, offset=0.5, width=0.25, hjust=0, colnames_style(color="white"),
         colnames=FALSE, colnames_offset_y = 8) + 
  theme(legend.position = 'none') +
  scale_fill_manual(values=c("white","#fd7f6f", "#7eb0d5", "#b2e061", "#bd7ebe", "#ffd200", "#e69138", "#beb9db", "#fdcce5", "#8bd3c7", "lightgray","darkgray"))

dev.off() 
## quartz_off_screen 
##                 2



TESTING AN ALTERNATE TOPOLOGY:

Here we perform a large subset of analyses on an alternate topology, as a sensitivity test on how the phylogenetic resolution can impact our findings.



data wrangling pt. 1b:
node_annotation = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)   #Path to node annotation

rates = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/rates_conv.txt",sep=" ",header=F)             #Path to per_species_rates
colnames(rates) <- c("node","duplication_rate","loss_rate","transfer_rate")
counts = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/counts_conv.txt",sep=" ",header=F)          #Path to per_species_counts
colnames(counts) <- c("node","S","SL","D","T","TL")
counts$S <- gsub("S=", "", counts$S)
counts$SL <- gsub("SL=", "", counts$SL)
counts$D <- gsub("D=", "", counts$D)
counts$T <- gsub("T=", "", counts$T)
counts$TL <- gsub("TL=", "", counts$TL)

tmp  <- left_join(node_annotation, rates, by = 'node')
data  <- left_join(tmp, counts, by = 'node')



Fig. S9a + S9b + S9c - DTL rates differ between symbiont and free-living branches:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

all_branches_events$state[all_branches_events$state == "E"] <- "endosymbiont"
all_branches_events$state[all_branches_events$state == "N"] <- "free-living"

state <- ""
levels(state) <- c("endosymbiont","free-living")
cols<-setNames(c("orange","#c0ecf5"),levels(state))

shapiro.test(all_branches_events$loss_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$loss_rate
## W = 0.90402, p-value = 2.361e-15
p <- ks.test(all_branches_events$loss_rate[all_branches_events$state == "endosymbiont"],all_branches_events$loss_rate[all_branches_events$state == "free-living"])$p.value
FigS9a <- ggplot(all_branches_events, aes(y=loss_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("loss rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

shapiro.test(all_branches_events$duplication_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$duplication_rate
## W = 0.26407, p-value < 2.2e-16
p <- ks.test(all_branches_events$duplication_rate[all_branches_events$state == "endosymbiont"],all_branches_events$duplication_rate[all_branches_events$state == "free-living"])$p.value
FigS9b <-  ggplot(all_branches_events, aes(y=duplication_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("duplication rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .05))

shapiro.test(all_branches_events$transfer_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$transfer_rate
## W = 0.81517, p-value < 2.2e-16
p <- ks.test(all_branches_events$transfer_rate[all_branches_events$state == "endosymbiont"],all_branches_events$transfer_rate[all_branches_events$state == "free-living"])$p.value
FigS9c <- ggplot(all_branches_events, aes(y=transfer_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("transfer rates p= ",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))



Fig. S9d - contingent gene losses overall:
timeline <-  read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

tmp <- timeline %>%
  mutate(name2=origin.x)

FigS9d <- ggplot(data=tmp, aes(x=coordinate, y=loss_rate, colour=origin)) + 
  xlab("") + theme_bw() + theme(text = element_text(size = 10)) + 
  geom_smooth(xintercept=-1, linetype="dashed", colour = "orange", size=1, alpha=0.2, colour=timeline$origin) +
  theme(
      legend.position="none",
      plot.title = element_text(size=14),
      panel.grid = element_blank()
    ) + theme_bw() + theme(axis.title.x=element_blank()) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
    theme(strip.background =element_rect(fill="white")) + 
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) +
  labs(y = "gene loss rate") + labs(x = "branch") + theme(plot.title = element_text(size = 10))



files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

ancs <- vector()  # genes present in Enterobacterales ancestor 
endo <- vector()  # genes present in at least an endosymbiont clade MRCA
aftr <- vector()  # genes originated or acquired in Enterobacterales after any shift to endosymbiosis  
nevr <- vector()  # genes which where originated or acquired in free-living only

ubiq <- vector()  # genes present in all Enterobacterales spp.
up50 <- vector()  # genes present in more than 50% of Endosmybionts spp.
dn50 <- vector()  # genes present in less than 50% of Endosmybionts spp.
lost <- vector()  # genes absent in Endosmybionts spp.

j=1

  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",i)
    OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts//","",OG)
    names(df2)=c("node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="node")
    if (sum(length(df2[df2$node == "s3",]$P)) > 0) {
      ancs[[j]] <- OG
    }
    if (sum(as.numeric(df2[df2$state == "EST",]$L)) > 0) {
      endo[[j]] <- OG
      df2 <- df2[with(df2, ! grepl("^s", node)),]
      df2 <- subset(df2, state == "E" |  state == "EST")
      df2 <- subset(df2, P == 1)
      if (length(df2$node) == 75) {
        ubiq[[j]] <- OG
        } else if (length(df2$node) > 36) {
          up50[[j]] <- OG
          } else if (length(df2$node) > 0) {
            dn50[[j]] <- OG
            } else if (length(df2$node) == 0) {
              lost[[j]] <- OG
           }
      } else {
        df2 <- df2[with(df2, ! grepl("^s", node)),]
        if (sum(length(df2[df2$state == "E" | df2$state == "EST",]$P)) > 0) {
          aftr[[j]] <- OG
        } else {
          nevr[[j]] <- OG
    }
      }
    j = j + 1
  }

length(files)         # Among all genes
## [1] 10152
length(na.omit(endo)) # genes present in at least one free-livinig MRCA of an endosymbiotic clade
## [1] 4476
length(na.omit(aftr)) # genes originated or acquired in endosmybionts after any shift to endosymbiosis  
## [1] 3081
length(na.omit(nevr)) # genes which where originated or acquired in free-living only
## [1] 2595
# Among genes which were present in at least one free-living MRCA of an endosymbiotic clade
length(na.omit(ubiq)) # genes present in all endosmybiont spp.
## [1] 6
length(na.omit(up50)) # genes present in more than 50% of endosmybionts spp.
## [1] 668
length(na.omit(dn50)) # genes present in less than 50% of endosmybionts spp.
## [1] 2946
length(na.omit(lost)) # genes lost in all endosmybionts spp.
## [1] 856
data wrangling pt. 3b:
files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

tokeep$Origin[tokeep$Trait != "N"] <- "endosymbiont"
tokeep$Origin[tokeep$Trait == "N"] <- "free-living"

for (k in c("free-living","endosymbiont")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "1"
    }
      j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(`df_free-living`,`df_endosymbiont`)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)
df$mean_freq <- as.numeric(df$mean_freq)

#

dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)
cog <- read.table("cog.tsv", sep="\t", header = T)
clean <- left_join(clean, cog, by = "OG")
clean <- clean %>% separate_rows(cog, sep = '(?<=.)(?=.)')
cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
clean <- left_join(clean, cog_conversion, by = "cog")



Fig. S9f:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, mean_loss_freq_endo != "NA")
clean <- subset(clean, mean_loss_freq_endo != "NaN")
clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free != "NaN")
clean <- subset(clean, dnds_free < 1)

clean <- clean[clean$OG %in% endo,]  

dim(clean) # number of OGs considered in this analysis
## [1] 4368    3
FigS9f <- ggplot(clean, aes(x=dnds_free, y=mean_loss_freq_endo)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  theme_bw() +
  xlab("dNdS in free-living") + 
  ylab("mean loss frequency in endosymbionts") +
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) + 
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1) +
  theme(legend.position = "none") + xlim(0,.75) +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(aspect.ratio = 1)

FigS9f <- FigS9f + stat_cor(method = "spearman", cex = 5)



data wrangling pt. 4b:
files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

for (k in c("A","B","C","D","E","F1",'F2',"G","H","I","L","M","0")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "1"
    }
    j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(df_A,df_B,df_C,df_D,df_E,df_F1,df_F2,df_G,df_H,df_I,df_L,df_M,df_0)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)



Fig. S9g:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean <- subset(clean, dnds != "NA")
clean <- subset(clean, mean_freq != "NA")
clean <- subset(clean, mean_freq != "NaN")
clean <- subset(clean, free != "NaN")
clean <- subset(clean, endo != "NaN")

cog <- read.table("cog.tsv", sep="\t", header = T)
clean_endo <- left_join(clean, cog, by = "OG")

clean_endo <- subset(clean_endo, cog != "NA")
clean_endo <- subset(clean_endo, cog != "-")

clean_endo <- clean_endo %>%separate_rows(cog, sep = '(?<=.)(?=.)')

clean_endo <- subset(clean_endo, cog != "A")
clean_endo <- subset(clean_endo, cog != "B")
clean_endo <- subset(clean_endo, cog != "W")
clean_endo <- subset(clean_endo, cog != "Z")

clean_endo <- left_join(clean_endo, cog_conversion, by = "cog")

clean_endo$mean_freq <- as.numeric(clean_endo$mean_freq)

clean_endo <- clean_endo[clean_endo$OG %in% endo,]  

A <- subset(clean_endo, origin == "A" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
A$rank <- rownames(A)
A$origin <- "A"

B <- subset(clean_endo, origin == "B" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
B$rank <- rownames(B)
B$origin <- "B"

C <- subset(clean_endo, origin == "C" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
C$rank <- rownames(C)
C$origin <- "C"

D <- subset(clean_endo, origin == "D" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
D$rank <- rownames(D)
D$origin <- "D"

E1 <- subset(clean_endo, origin == "E1" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E1$rank <- rownames(E1)
E1$origin <- "E1"

E2 <- subset(clean_endo, origin == "E2" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E2$rank <- rownames(E2)
E2$origin <- "E2"

effe <- subset(clean_endo, origin == "F" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
effe$rank <- rownames(effe)
effe$origin <- "effe"

G <- subset(clean_endo, origin == "G" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
G$rank <- rownames(G)
G$origin <- "G"

H <- subset(clean_endo, origin == "H" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
H$rank <- rownames(H)
H$origin <- "H"

I <- subset(clean_endo, origin == "I" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
I$rank <- rownames(I)
I$origin <- "I"

L <- subset(clean_endo, origin == "L" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
L$rank <- rownames(L)
L$origin <- "L"

zero <- subset(clean_endo, origin == "0" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
zero$rank <- rownames(zero)
zero$origin <- "zero"

loss_freq <- dplyr::bind_rows(A,B,C,D,E1,E2,effe,G,H,I,L,zero)
loss_freq$freq <- as.numeric(loss_freq$freq)
loss_freq$rank <- as.numeric(loss_freq$rank)

loss_freq$origin[loss_freq$origin == "effe"] <- "F"
loss_freq$origin[loss_freq$origin == "zero"] <- "free"

loss_freq <- loss_freq %>%  mutate(mycolor = ifelse(origin == "free", "#c0ecf5", "orange"))

order <- aggregate(rank~description,loss_freq,sum)
order <- order[order(order$rank),]$description

FigS9g <- ggplot(loss_freq, aes(x = factor(description, order), y=(as.numeric(rank)), label=origin, fill=mycolor)) + 
  geom_point() + 
  geom_label_repel(max.overlaps = Inf, min.segment.length = 0) + scale_fill_identity(c("#c0ecf5", "orange")) +
  theme_bw() + 
  #coord_flip() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("") + ylab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 12)) + theme(axis.text.x = element_text(size = 12)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency ranks")



Fig. S9e:
df_cor <- data.frame(clean_endo[,c(-4,-5,-6,-7,-8,-9)])
df_cor <- reshape(df_cor, idvar = "OG", timevar = "origin", direction = "wide")
row.names(df_cor) <- df_cor$OG
df_cor <- df_cor[,c(-1)]

new_order <-  sort(colnames(df_cor),decreasing=F)
df_cor <- df_cor[, new_order]

colnames(df_cor) <- c("free-living","A","B","C","D","E","F1","F2","G","H","I","L","M")
my.palette <- get_palette(c("white", "white", "orange"), 10)
cor.test <- df_cor %>% cor_test(method = "spearman", use='pairwise.complete.obs')
cor.test$p <- p.adjust(cor.test$p, method = "hochberg", n = length(cor.test$p))
cor.mat <- as_cor_mat(cor.test)
p.mat <- cor_pmat(df_cor)

FigS9e <- ggcorrplot(cor.mat, p.mat = p.mat, sig.level = 0.001, hc.order = F, type = "lower", outline.col = "white", ggtheme = ggplot2::theme_bw,
           colors = c("white", "white", "orange"),method = "circle") + 
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(legend.position = "none") 



wrapping up Fig. S9 - topology sensitivity test:
FigS9tmp <- (FigS9a | FigS9b | FigS9c) /
            (FigS9d | FigS9e | FigS9f) /
                      FigS9g

FigS9 <- FigS9tmp + plot_annotation(tag_levels = 'A')

ggsave("FigS9.pdf", FigS9, width = 12, height = 16)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
FigS9
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'



wrapping up - save workspace:
save.image(file="workspace.RData")